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Given a finite collection of estimators or classifiers, we study the 
problem of model selection type aggregation, that is, we construct a 
new estimator or classifier, called aggregate, which is nearly as good 
as the best among them with respect to a given risk criterion. We 
define our aggregate by a simple recursive procedure which solves 
an auxiliary stochastic linear programming problem related to the 
original nonlinear one and constitutes a special case of the mirror av- 
eraging algorithm. We show that the aggregate satisfies sharp oracle 
inequalities under some general assumptions. The results are applied 
to several problems including regression, classification and density 
estimation. 

1. Introduction. Several problems in statistics and machine learning can 
be stated as follows: given a collection of M estimators, construct a new 
estimator which is nearly as good as the best among them with respect 
to a given risk criterion. This target is called model selection (MS) type 
aggregation, and it can be described in terms of the following stochastic 
optimization problem. 

Let {Z^"^) be a measurable space and let Q be the simplex 

e = G M^^ ^ 9^^^ = 1, 9^^^ > 0, J = 1, . . . , m| . 

Here and throughout the paper we suppose that M > 2 and we denote by 
z*--^^ the jth component of a vector z G M^^. We denote by [z^-'^]^!]^ the vector 

Let Z he a random variable with values in Z. The distribution of Z is 
denoted by P and the corresponding expectation by E. Suppose that P 
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is unknown and that we observe n i.i.d. random variables Zi, . . . , Z„ with 
values in Z having the same distribution as Z. We denote by Pn the joint 
distribution of Zi, . . . ,Zn and by En the corresponding expectation. 

Consider a measurable function Q :Z x — > M and the corresponding 
average risk function 

A{e) = EQiZ,9), 

assuming that this expectation exists for all ^ € 0. Stochastic optimization 
problems that are usually studied in this context consist in minimization of 
A on some subsets of 0, given the sample Zi, . . . , Zn- Note that since the 
distribution of Z is unknown, direct (deterministic) minimization of A is not 
possible. 



For j G {1, . • . ,M}, denote by ej the jth coordinate unit vector in 



pM. 



Cj = (0, . . . , 0, 1, 0, . . . , 0) G M^^, where 1 appears in jth position. 

The aim of MS aggregation is to "mimic the oracle" miuj A{ej), that is, to 
construct an estimator 0„ measurable with respect to Zi, . . . , Z„ and called 
aggregate, such that 

(1.1) EnA{en)< min A(ej) + A„,M, 

l<j<A/ 

where ^n,M > is a remainder term that should be as small as possible. 
Thus, the stochastic optimization problem associated to MS aggregation is 

min A[9). 

6»e{ei,...,eM} 

As an example, one may consider the loss function of the form Q{z,9) = 
£{z,9~^H) where i:ZxM.^M. and H = {hi,. . . ,hM)~^ is a vector of pre- 
liminary estimators (classifiers) constructed from a training sample which 
is supposed to be frozen in our considerations (thus, hj can be viewed as 
fixed functions). The value A{ej) = E£{Z,hj) is the risk corresponding to 
hj. Inequality (1.1) can then be interpreted as follows: the aggreg ate 9jH, 
that is, the convex combination of initial estimators (classifiers) hj, with the 
vector of mixture coefficients 9n measurable with respect to Z\, . . . ,Zn, is 
nearly as good as the best among h\, . . . ,hM- The word "nearly" here means 
that the value minjj4(ej) is reproduced up to a reasonably small remainder 
term A„^jv/. Lower bounds can be established showing that, under some as- 
sumptions, the smallest possible value of ^n,M in a minimax sense has the 
form 

n 

with some constant C > 0; cf. [24]. 

Besides being in themselves precise finite sample results, oracle inequali- 
ties of the type (1.1) are very useful in adaptive nonparametric estimation. 
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They allow one to prove that the aggregate estimator O^H is adaptive in 
a minimax asymptotic sense (and even sharp minimax adaptive in several 
cases; for more discussion see, e.g., [18]). 

The aim of this paper is to obtain bounds of the form (1.1)-(1.2) under 
some general conditions on the loss function Q. For two special cases [density 
estimation with the Kullback-Leibler (KL) loss, and regression model with 
squared loss] such bounds have been proved earlier in the works of Catoni 
[7, 8, 9] and Yang [29]. They independently obtained the bound for density 
estimation with the KL loss, and Catoni [8, 9] solved the problem for the 
regression model with squared loss. Bunea and Nobel [5] improved the re- 
gression with squared loss result of [8, 9] in the case of bounded response, 
and obtained some related inequalities under weaker conditions. For a prob- 
lem which is different but close to ours (MS aggregation in the Gaussian 
white noise model with squared loss) Nemirovski [18], page 226, established 
an inequality similar to (1.1), with a suboptimal remainder term. Leung 
and Barron [15] improved upon this result to achieve the optimal remainder 
term. 

Several other works provided less precise bounds than (1.1)-(1.2), with 
KmmjA{ej) where the leading constant K > 1, instead of minj A{ej) in 

(1.1) and with a remainder term which is sometimes larger than the optimal 
one (1.2); a detailed account can be found in the survey [4] or in the lecture 
notes [17]. We mention here only some recent work where aggregation of 
arbitrary estimators is considered: [1, 6, 16, 22, 28, 30]. These results are 
useful for statistical applications, especially if the leading constant K is close 
to 1. However, the inequalities with K > 1 do not provide valid bounds for 
the excess risk EnA{9n) — miuj A(ej), that is, they do not show that On 
approximately solves the stochastic optimization problem. 

Below we study the mirror averaging MS aggregate On which is defined by 
a simple recursive procedure (cf. Section 3). This procedure outputs a convex 
mixture of initial estimators. Before defining the procedure, we give some 
arguments in favor of considering mixtures rather than selectors. Selectors 
are estimators with values in {ei, . . . jCm}, for example, minimizers of the 
empirical risk. In Proposition 2.1 we show that selectors cannot satisfy (1.1)- 

(1.2) , even for the simplest case where the loss function Q is quadratic. 
The main results of the paper are given in Section 4; there we prove that 
the suggested mirror averaging aggregate satisfies oracle inequalities (1-1)- 
(1.2) under some general assumptions on Q. Finally, we show in Section 5 
that these assumptions are fulfilled for several statistical models including 
regression, classification and density estimation. 

2. Suboptimality of selectors. Recall that our goal is to construct an 
estimator On that satisfies an oracle inequality of the type (1.1). A traditional 
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way to approach this problem is based on empirical risk minimization. Define 
the empirical risk An by 

1 " 

Anie) = -J2QiZi,o) 

1=1 

and the empirical risk minimizer (ERM) by 

0^^^= argmin An{9). 

ee{ei,...,enf} 

Clearly, the ERM selects one of the M initial estimators. More generally we 
call selector any estimator T„ based on the sample {Zi, . . . , Z^) having this 
property, that is, such that T„ takes values in {ei, . . . ,eM}- 

The following example shows that under the squared loss the rate of con- 
vergence An^M in (1.1) for any selector = T„ is not faster than ^(log A/)/n 
which is substantially worse than the optimal rate given in (1.2). 

Indeed, consider the squared loss 

(2.1) Q{z,9) = ^9'^e-z'^e, zeR^'^,9ee. 

For /c = 1, . . . , M denote by the distribution of a Gaussian random vec- 
tor Z G R^^ with mean e^ia 12)^ (logM)/n and the covariance matrix a^l 
where / stands for the identity matrix, and denote by the corresponding 
expectation. It is easy to see that the risk ^fc(') = E^^iZ^ •)] satisfies 

(2.2) Afc(efc) = l/2-(^7/2)y(logM)/n, ^^(e,-) = 1/2, A; / j. 

Therefore A}^ admits a unique minimum over the set of vertices {ei, . . . , cm} 
and the minimum is attained at e^. 

Proposition 2.1. LetQ he the squared loss function (2.1). Assume that 
we observe i.i.d. random vectors Zi, . . . , Z„ with the same distribution as Z. 
Denote by the expectation with respect to the sample Zi, . . . ,Zn when Z 
has distribution . Then there exists an absolute constant c> such that 

(2.3) inf sup |£;^[^fc(rj] - min Afc(e,) j>ca 

Tn k=l,...,M I i<3<M J 

where the infimum is taken over all the selectors T^. 

A weaker result of similar type [with the rate 1 / ^/n instead of (log M)/n] 
is given in [14]. Proposition 2.1 implies that the slow rate (logM)/n is the 
best attainable rate for selectors, since the standard ERM selector satisfies 
the oracle inequality (1.1) with rate A„^m ~ \/(log M)/n. Proof of Proposi- 
tion 2.1 is given in Section 6. 
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The squared loss function (2.1) satisfies the assumptions of Theorems 4.1 
and 4.2 below. As a consequence, the corresponding aggregated estimate On, 
provided by the algorithm of Section 3, attains the bound with fast rate 
(IogM)/n: 

EiAkiL) < ^min^^fe(ej) + (g + ^"g Vfc = 1, . . . ,M. 

On the other hand, for the same squared loss. Proposition 2.1 shows that a 
selector with values in {ei, . . . , ca/ }, in particular the ERM, cannot satisfy 
an oracle inequality of the type (1.1) with the rate faster than ^(log M)/n. 
This observation suggests that extending the set of possible values of the 
estimator to the whole simplex may help to obtain faster rates of aggre- 
gation. 

3. The algorithm. Procedures with values in G, that is, convex mixtures 
of the initial estimators, can be constructed in various ways. One of them 
originates from the idea of mirror descent due to Nemirovski and Yudin [19]. 
This idea has been further developed in [3, 20], mainly in the deterministic 
optimization framework. A version of the mirror descent method due to 
Nesterov [20] has been applied to the aggregation problem in [12] under the 
name of mirror averaging. As shown in [12], for convex loss functions Q the 
mirror averaging estimator On satisfies under mild assumptions the following 
oracle inequality: 



(3.1) EnA{On)<n'AnA{0) + Co 

where Cq > is a constant depending only on the supremum norm of the 
gradient Ve(5(-,-)- The name mirror averaging reflects the fact that the al- 
gorithm does a stochastic gradient descent in the dual space with further 
"mirroring" to the primal space and averaging; for more details and discus- 
sion see [12]. 

Note that in (3.1) the minimum is taken over the whole simplex O, so an 
inequality of the type (1.1) holds as well, but for large n the remainder term 
in (3.1) is of larger order than the optimal one given in (1.2). 

To improve upon this, consider the following auxiliary stochastic linear 
programming problem. If A is a convex function, we can bound it from above 
by a linear function: 

M 

i=i 

where A{0) = EQ{Z,0), with 

Q{Z, 0) ^ 0~^u{Z), u{Z) ^ {Q{Z, ei), . . . , Q(Z, cm))^ ■ 
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Note that 

A{ej) = A{ej), j = l,...,M. 

Since is a simplex, the minimum of the Unear function A is attained at 
one of its vertices. Therefore, 

m.mA(6) = min ^(e,), 
eee i<j<M 

which shows that the linear stochastic programming problem of minimiza- 
tion of A on is linked to the problem of MS aggregation. This also suggests 
that the mirror averaging algorithm of [12] applied to minimization of the 
linear function A could make sense to achieve our MS aggregation goal. Par- 
ticularizing the definition of mirror averaging procedure from [12] for linear 
function A, we get the following algorithm. 
For /3 > define the function Wp : M*^ ^ M by 



(3.2) W,{z) ^ (3\og f: , , = , . . . , ,(M). 



The gradient of Wjs is given by 



VWpiz) 



Consider the vector 

Ui = {Q{Zi,ei), . . . ,Q{Zi,eM)V =u{Zi) = VgQ{Z„9), 
and the iterations: 

• Fix the initial values 9o£@ and Co = S . 

• For z = l,...,n — 1, do the recursive update 

Ci = Ci-1 +Ui, 

(3.3) 

= -vWpiQ). 

• Output at iteration n the average 

1 " 

(3.4) en = -Y.^i~i- 

Tl . 

1=1 



Note that the estimator On is measurable with respect to (Zi, . . . , Z„,_i). 
ponents of the vector 9i from (3.1 

jj) _ exp (-/?-! Y.m=l Q{Zm, ej)) 



The components of the vector 9i from (3.3) can be written in the form 



E^i exp (-/3-1 EUi Q{Zm,ek)) 
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The "mirroring" function VW/j maps the variables Ci that take on values 
in the dual space (which is M*'^ equipped with the ioo norm) to the primal 
space (which is the ii body 6); cf. [12]. Note that VFg defined in (3.2) is 
not the only possible choice; other functions VFg satisfying the conditions 
described in [12] can be used to construct the updates (3.3). 

We arrived at the algorithm (3.3)~(3.4) by a linear stochastic program- 
ming argument. It is interesting that several particular cases or versions of 
this algorithm are well known, and they were derived from different consid- 
erations. We mention first the literature on prediction of individual deter- 
ministic sequences. For a detailed account on this subject see [10]. A general 
problem considered there is for an agent to compete against the observed 
predictions of a group of experts, so that the agent's error is close to that 
of the best expert. In that framework the observations Zi are supposed to 
be uniformly bounded nonrandom variables, and the risk function is defined 
as the cumulative loss over the trajectory. Interestingly, for such problems, 
which are quite different from ours, methods similar to (3.3) constitute one 
of the principal tools; cf. [11, 13, 23, 26, 27]. However, in contrast to our 
procedure, those methods do not involve the averaging step (3.4); they do 
not need it because they deal with non-random observations and cumula- 
tive losses. Note that the algorithm with the averaging step (3.4) included, 
that is, the one that we consider here, has also been discussed in the lit- 
erature, though only for two specific combinations of loss function/model: 
the squared loss Q in regression model [5, 8, 9] and the Kullback-Leibler 
loss Q in density estimation [7, 9, 29]. It is interesting that in the latter 
case the algorithm (3.3)-(3.4) can be derived using information-theoretical 
arguments from the theory of source coding; cf. [9]. 

Remark that we define algorithm (3.3)-(3.4) for a general loss function 
Q, and we consider arbitrary i.i.d. data Zi, not restricted to a particular 
model. 

Since (3.3)-(3.4) is a special case of the mirror averaging method of [12] 
corresponding to a linear function A, the coarse oracle inequality (3.1) re- 
mains valid with A replaced by A. But we show below that in fact On satisfies 
a stronger inequality, that is, one with the optimal rate (1.2). 

4. Main results. In this section we prove two theorems. They establish 
oracle inequalities of the type (1.1) for On- Theorem 4.1 requires a more 
conservative assumption on the loss functions Q than Theorem 4.2. This 
assumption is easier to check, and it often leads to a sharper bound but 
not for such models as nonparametric density estimation with the L2 loss 
which will be treated using Theorem 4.2. In some cases (e.g., in regression 
with Gaussian noise) Theorem 4.1 yields a suboptimal remainder term, while 
Theorem 4.2 does the correct job. In both theorems it is supposed that the 
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values A{ei), . . . , A{eM) are finite. We will also need the following definition. 



Definition 4.1. A function T :M.^^ — > M is called exponentially concave 
if the composite function expoT is concave. 

It is straightforward to see that exponential concavity of a function —T 
implies that T is convex. Furthermore, if —T/ (3 is exponentially concave for 
some /3 > 0, then —T/f3' is exponentially concave for all (3' > j3. Let Qi be 
the function on ^ x x 6 defined by Qi{z, 9, 6') = Q{z, 9) — Q{z, 6') for all 
z€Z and all 6,9' €Q. 

Theorem 4.1. Assume that Qi can be decomposed into the sum of two 
functions Qi = Q2 + Qs such that: 

• The mapping 9^ —Q2{z,9,9')/(3 is exponentially concave on the simplex 
e,for allz£Z, 9' £Q, and Q2{z,9,9) = for allzeZ,9ee. 

• There exists a function R on Z integrable with respect to P and such that 
-Q3{z,9,9') < R{z), for all z(^Z, 9,9' £0. 

Then the aggregate 9^ satisfies, for any M >2,n>\, the following oracle 
inequality: 

BI0S.M 

En^iA[9n)< min A(e,-) + ^ ^ + E[R{Z)\. 
i<j<M n 

Theorem 4.2. Assume that for some P > there exists a Borel function 
^f^-.Q X ^ such that the mapping 9 ^fj{9,9') is concave on the 
simplex G for any fixed 9' G Q, 'I'f3{9,9) = 1 and Eexp{-Qi{Z,9,9')/P) < 

^/3{9, 9') for all 9,9' £&. Then the aggregate 9n satisfies, for any M > 2,n> 
1, the following oracle inequality: 

dlosM 

En^iA{9n)< min A(e,) + 

i<i<A4' n 

Proofs of both theorems are based on the following lemma. Introduce the 
discrete random variable with values in the set {ei, . . . , cm} and with the 
distribution P defined conditionally on {Zi, . . . , Zn-i) by P[ CO — Cjl — 9ri^ 

where 9n is the jth component of 9n- The expectation corresponding to P 
is denoted by E. 

Lemma 4.1. For any measurable function Q and any P > we have 

BlosM 

(4.1) S„-i^(e„) < ^ mm^^(e,) + + S^, 
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where 



Qi{Zn,u;,E,[uj]) 



Proof. By definition of for i = 1, . . . , n, 



Wi3iCi)-Wp{Ci-i)=f3log 



(4.2) 
where 



,T 



(i\oz{-vl VT^;3(Ci-i)) = /31ogK' di.r), 



exp 



M 



Taking expectations on both sides of (4.2), summing up over z, using the 
fact that {9i^i,Zi) has the same distribution as {6i-i,Zn) for i = 1, . . . ,n, 
and applying the Jensen inequality, we get 



- ■ 2^ log 2^ eiJ^ exp 



n 



n " ^ 



(4.3) 



\i=i 

on /A/ 



Q{Zi,ej) 



n 



i=l 



\j=i 



13 . 
Q{Zn,ej) 



/3 



M 



\j=i 



Q ( ) (^j 



s. 



Since Qi{z,uj,E[uj]) = Q{z,uj) - Q{z,E[uj]) and E[w] =6l„, the RHS of (4.3) 
can be written in the form 

Q{Zn,uj) 



(4.4) 



S' = /3^nlog(^Eexp 
= PEn log ( exp 



Q(Z„,EM) 



P 



+ Si 



-En-lAien) + Si. 



We now bound from below the LHS of (4.3). For any j* = 1,...,M, by 
monotonicity of the function log(-), we have 

WpiCn) > p\og(^e-^"*'/^] = -piogM - Cf\ 
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where (n ' = Cn ^j* is the component of Cn- Set j* = aigmiwi^j^j^j 
Then, using the fact that Wp{C,Q) = Wp{d) = we obtain 



En [Wp (Cn) - Wp (Co)] ^ /? log M En [Q^e,^ 



(4.5) 



n 



n 

plogM 

n 



n 



min Aid). 

i<j<M ■' 



Combining (4.3), (4.4) and (4.5) gives the lemma. □ 

In view of Lemma 4.1, to prove Theorems 4.1 and 4.2 it remains to give 
appropriate upper bounds for Si. 

Proof of Theorem 4.1. Since Qi = Q2 + Q3, with -Q-i{z,9,6') < 
R{z) for all z £ Z,6,9' £ 0, the quantity can be bounded from above as 
follows: 



Si < /9^„log Eexp 



Q2iZn,u;,E[uj]) 



/3 



+ En[RiZn)]. 



Now since —Q2{z,-)/Pis exponentially concave on Q for all z G Z, the Jensen 
inequality yields 



Eexp 



Q2iZn,u;,E,[uj]) 



/3 



< exp 



Q2(^n,IEM,E[a;]) 



1. 



Therefore 5i < En[R{Zn)]- This and Lemma 4.1 imply the result of the 
theorem. □ 



Proof of Theorem 4.2. Using the Jensen inequality twice, with the 
concave functions log(-) and ^'/3(-,E[w]), we get 

Qi(Z,w,E[w])- 



< log (^£;E exp 

= pEn-^ilog (ESexp 



(4.6) 



/3 

Qi(Z,cj,E[cj]) 



<l3En^ilog{E^p{io,E[Lo])) 
</?S„_ilog(^'^(EM,E[cu])) = 0, 



where the first equality is due to the Fubini theorem. Theorem 4.2 follows 
now from (4.6) and Lemma 4.1. □ 

Remark. A particular case of Theorem 4.1 where Qs = and the loss 
Q is uniformly bounded in z, 6 can be derived from the theory of prediction 
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of deterministic sequences discussed in Section 3 above. We sketch here 
the argument that can be used. If written in our notation, some results of 
that theory (see, e.g., [13, 23] or Section 3.3 of [10]) are as fohows: under 
exponential concavity of 6 —r]Q{z, 9) for some 77 > and boundedness of 
sup^ \Q{z^6)\, for any fixed sequence Zi we have 

(4.7) -y^Q{Zi,e,^i) < min -VQ(Z„ej) + ^^ 
n 7=1, ...,Af n n 

1=1 ■' 1=1 

where C is a constant depending only on /? and on the value sup^ g \Q{z^ ^)| • 
Assuming now that Zj are random and i.i.d., taking expectations in (4.7) 
and interchanging the expectation and the minimum on the right-hand side 
we obtain 

(4.8) En(-Y.Q{Z,A^^)\< min A(e,) + 

j j=i,...,M n 

Now, exponential concavity of 1— > —r]Q{z, 6) implies convexity oiO Q{z, 9) 
and thus convexity of A{-). Therefore, since 9i-i is measurable with respect 
to Zi, . . . , using Jensen's inequality and the definition of 9n we get 



^n(-EQ(^.,^*-i)) = -Y.E,^iA{9i^r) 



(4.9) 



1=1 




A{9n). 



Combining (4.8) and (4.9) we get inequality of the form (1.1)-(1.2). We note 
that such an argument can be used as an alternate proof of Corollary 5.3 in 
the next section. However, it does not apply to other examples that we treat 
below using Theorems 4.1 and 4.2 since in those examples either the loss is 
not bounded or the exponential concavity condition is not satisfied. We need 
only some approximate exponential concavity (when using Theorem 4.1) or 
a kind of "exponential concavity in the mean" (when using Theorem 4.2). 

5. Examples. In this section we apply Theorems 4.1 and 4.2 to three 
common statistical problems (regression, classification and density estima- 
tion) in order to establish some new oracle inequalities. In particular, we 
cover the two examples for which our algorithm has been already studied 
in the literature: regression model with squared loss and density estimation 
with KL loss. For the latter case we observe that our general argument easily 
implies the earlier results [7, 9, 29], while for regression with squared loss 
we significantly improve what was known before [5, 8, 9]. 

All the loss functions considered below are twice differentiable. The follow- 
ing proposition gives a simple sufficient condition for exponential concavity. 
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Proposition 5.1. Let g he a twice differentiable function on with 
gradient Vg{9) and Hessian matrix \7'^g{9), G 0. // there exists f3 > 
such that for any 9 £ Q, the matrix 

PV^g{9)-Vgi9)iVg{9))'^, 

is positive semidefinite, then —g{-)/ (5 is exponentially concave on the simplex 
0. 



Proof. Since g is twice differentiable exp(— is also twice differ- 
entiable with Hessian matrix 

■Vg{9)iyg{9)y 



(5.1) 



?^(^) = ^exp 



/3 



For any A G W^'^ , 6* G 0, we have 



A^7^(0)A = ^exp 



9M 



/3 



(A^Vg(g))- 



\'[V^g{9)]\ 



<o. 



Hence exp(— (7(-)//3) has a negative semidefinite Hessian and is therefore 
concave. □ 



5.1. Application of Theorem 4-1- We begin with the models that satisfy 
assumptions of Theorem 4.1. 

1. Regression with squared loss. Let Z = X x M. where X is a complete 
separable metric space equipped with its Borel cr-algebra. Consider a random 
variable Z = {X, Y) with X € X and y G M. Assume that the conditional 
expectation f{X) = E(Y\X) exists and define ^ = 1" — E{Y\X), so that 

(5.2) y = /(X) + e, 

where X £ X is a random variable with probability distribution Px, Y £M, 
/ : — > M is the regression function and is a real- valued random variable 
satisfying £'(^|X) = 0. Assume that E{Y'^) < oo and ||/||oo < L for some 
finite constant L > where || • ||oo denotes the Loo(-Px)-iiorm. We have M 
functions /i,...,/a,/ such that ||/j||oo < L,j = 1,...,M. Define ||/|||p^ = 
f'^{x)Px{dx). Our goal is to construct an aggregate that mimics the or- 
acle mini<j<A./ Il/j — /||2Py- The aggregate is based on the i.i.d. sample 
{Xi,Yi), . . . ,{Xn,Yn) where {Xi,Yi) have the same distribution as {X,Y). 
For this model, with z = {x,y) € X xM, define the loss function 

Q(^z,9) = {y-9'^H{x)f V0G0, 
with H{x) = {fi{x),..., fM{x)y . It yields for ah zeZ,9,9'e 0, 
Qi{z,9,9') = Q{z,9) - Q{z,9') = 2y{9' - 9)^ H{x) + [9^^ H{x)f - [9'^H{x)f. 
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Consider positive constants b and B and assume that /J > (b/B)'^. We now 
decompose Qi into the sum Qi = Q2 + Qs, where 

Q2iz,e,9') = 2yl{|,|<B/3}(e' - 0)^Hix) + [e^H{x)f - [9'^H{x)f 

and 

,2 



Tuf^\ y_\ta' a\T^J^^\^2, 
Bp^ 

We have 

42^2 2 

(5.3) - Qs{z, e, 9') < 4L|y|l||,|>^^} + ^1{,^<|,|<5^} = Rpiv)- 

On the other hand, Q2{z,6,6) = 0,^6 £ @, z £ Z and we can prove that the 
mapping 6 1-^ —Q2{z,6,6')/f3 is exponentiahy concave for any z G Z,6' £ Q 
when b and B are properly chosen. For all G and z = {x,y) the gradient 
and Hessian of Q2 are respectively given by 

VeQ2 = VBQ2{z,e,e') 

= -2{yl{\y\^Bp}-0'^H{x))H{x) 

7/2 
2^ -ti 



and 

V^eQ2 = V^,Q2(^, 0, 9') = 2H{x)H{x)'^ + 2^1^^^^^^^^^^^^Hix)H{xf . 

We now prove that Proposition 5.1 applies for g{9) = Q2{z,6,9'), for all 
z = {x,y) £ Z and 9' £Q. For any A G M^^, any 9,9' £& and any z£Z, 

{\'VeQ2f < {2\y\^\y\<Bp} + 2L + ^1{,V^<|,,<b,}) '[^^^(^)]'- 
Note now that \y\ <Bj3 implies that y"^ /B[5 < \y\. Hence 
{X'VeQ2f < (2|y|l^|,|<,^} + 2L + (4L + 2)|y|l^^^^|^|^^^p^[A^i?(x)]2 

< (86^/3 + + 2(4L + -^flyl^^^^^y^^Bpy) [A'^^I^)]'- 

Therefore 



(3 



<(862 + ^_2 + 



2(4L + 2)2-| 
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If we choose B<{4L + 2)'^ and LB < 6 < 1/4, the above quadratic form is 
smaher than or equal to and Proposition 5.1 applies for any (3 > {b/B)'^. 
Now, since A{e) = EQ{Z,e) = E{Y - O'^ H{X)f = \\f - O'^ H\\\p^ + E{i'^) 
for all G 0, we obtain the following corollary of Theorem 4.1. 

Corollary 5.1. Consider the regression model (5.2) where X € X, 
y € M, / : — > M and S^=Y — f{X) is a real-valued random variable satisfy- 
ing E{^\X) = 0. Assume also that E{Y'^) < oo and ||/j||oo 1^ L, j = 1, . . . , M , 
for some finite constant L > 0. Then for any positive constants B > (4L + 
2)~^,LB <b < 1/4 and any (5 > (b/B)'^, the aggregate estimator fn{x) = 
9^H{x),x G X, where 9n is obtained by the mirror averaging algorithm, sat- 
isfies 

(5.4) E,,^^\\fl - fWlp^ < , min II/, - /|||p^ + + i?[i?^(y)], 

where 

My) = ^My\M\y\>Bf3} + -^^{b^<\y\<Bn- 

This result improves an inequality obtained by [5]: it yields better rate 
under the same moment conditions. We note that the aggregate fn as in 
Corollary 5.1 is of the form suggested by Catoni [8, 9]. If there exists a 
constant Lq > such that |y| < Lq a.s., the last summand disappears for 
P > 16Lq, and in this case (5.4) can be also deduced from [8, 9], though in 
a coarser form and under a more restrictive assumption on (3. 

An advantage of Corollary 5.1 is that no heavy assumption on the mo- 
ments of ^ is needed to get reasonable bounds. Thus, the second moment 
assumption on Y is enough for a bound with the rate. Indeed, choos- 

ing /3 ~ (n/logM)^/^^"'"**\ s > 0, in Corollary 5.1, we immediately get the 
following result. 

Corollary 5.2. Consider the regression model (5.2) where X £ X, 
Y gM, f : X ^M. and £, = Y — f{X) is a real-valued random variable satis- 
fying E{^\X) = 0. Assume also that £^(|y|*) < < oo for some s>2 and 
ll/jlloo < L,j = 1, . . . ,M, for some finite constant L > 0. Then there exist 
constants Ci > and C2 = 6*2(7713, L,Ci) > such that the aggregate esti- 
mator fn{x) = 9^H{x),x G X , where On is obtained by the mirror averaging 
algorithm with /? = Ci(?i/logM)^/(^+'*\ satisfies 



(5.5) i?n-i||/n-/|||p,< min ll/.-Zlllp^+Ca 



iogA/y/(2+s) 
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2. Classification. Consider the problem of binary classification. Let {X,J^) 
be a measurable space, and set Z = X x {—1, 1}. Consider Z = {X,Y) where 
X is a random variable with values in X and y is a random label with 
values in { — 1, 1}. For a fixed convex twice differentiable function : M — > M_|_, 
define the (/9-risk of a real-valued classifier h:X^ [—1, 1] as Eip{—Yh{X)). 
In our framework, we have M such classifiers hi,...,hM and the goal is 
to mimic the oracle miiiK j<M E(p{—Yhj{X)) based on the i.i.d. sample 
{Xi,Yi), . . . ,{Xn,Yn) where {Xi,Yi) have the same distribution as {X,Y). 
For any z = {x,y) £ X x {—1, 1}, we define the loss function 

Q{z, e) = ip{-ye^H{x)) > G G, 

where H{x) = {hi{x), . . . ,hM{x))~^ . For such a function and for all ^ € 0, 
z = {x,y) £ X X {—1, 1} we have 

VeQi{z,e,9') = -yip'{-y9^H{x))H{x), 

VleQii^, = v"{-yO^H{x))H{x)H{x)'^. 

Thus, from Proposition 5.1 the mapping 6* i— > —Qi{z,6,6')/ (3 is exponen- 
tially concave for all z and 6' (3 > where is such that [(^'(x)]^ < 
I3^ip"{x), V|x| < 1. Now, since 

A{e) = EQ{z,e) and Q{z,e) = ip{-Ye^ H{x)), ye€e,z = {x,Y), 

we obtain the following corollary of Theorem 4.1 applied with Q2 = Qi and 

Q3 = o. 

Corollary 5.3. Consider the binary classification problem as described 
above. Assume that the convex function ip is such that 

[v'ix)f<P^ip"ix) V|x|<l. 

Then the aggregate classifier hn{x) = 9~^H{x),x G X, where On is obtained 
by the mirror averaging algorithm with f3 > (3^, satisfies 

(5.6) Enifi-YnhniXn)) < ^mm^^E^{-Yhj{X)) + ^ ° . 

For example, inequality (5.6) holds with the exponential Boosting loss 
ifi (x) = , for which P^p-^ = e and for the Logit-Boosting loss (/?2 (x) = log2 (1 + 
e^) (in that case = elog2). For the squared loss ^psix) = (1 — x)^ and the 
2-norm soft margin loss ^4,{x) = max{0, 1 — x}^ inequality (5.6) is satisfied 
with P>2. 

3. Nonparametric density estimation with Kullback-Leibler (KL) loss. Let 
X be a random variable with values in a measurable space [X,T). Assume 
that the distribution of X admits a density p with respect to a cj-finite 
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measure /i on {X,J^). Assume also that we have M probabihty densities 
Pj with respect to /x on {X,J^) (estimators of p) and of an i.i.d. sample 
Xi, . . . ,Xn where Xi take values in X, and have the same distribution as 
X. Define the KL divergence between two probability densities p and q with 
respect to fi as 



ICip,q)^ f log(P^)pixUdx) 



if the probability distribution corresponding to p is absolutely continuous 
with respect to the one corresponding to g, and /C(p, q) = 00 otherwise. We 
assume that the entropy integral / p{x)logp{x)fi{dx) is finite. 

Our goal is to construct an aggregate that mimics the KL oracle defined 
by mini<j<A//C(p,pj). For x £ X, 9 £ Q, we introduce the corresponding loss 
function 

(5.7) Q{x,e) = -log{e^H{x)), 

where H{x) = {pi{x), . . . ,pm{x))~^ ■ We set Z = X. Then 

A{9) = EQ{X, G) = - j log{e^ H{x))p{x)fi{dx) 

where the integral is finite if all the divergences JC{p,pj) are finite. In par- 
ticular, A{ej) = IC{p,pj) — J p{x) log p{x)iJ,{dx). Since, for all x £ X, we have 

eM-Qi{x,0,e')/P) = {9'^H{x)f'^{9'^H{x)r^ll', 

the mapping 9 —Qi{x,9,9')/P is exponentially concave on for any /3 > 
1. Hence, we can apply Theorem 4.1, again with Q2 = Qi and = and 
we obtain the following corollary. 

Corollary 5.4. Consider the density estimation problem with the KL 
loss as described above, such that ^ p{x)\\ogp{x)\ii{dx) < 00. Then the ag- 
gregate estimator pn{x) = 9j^H{x),x S X, where 9n is obtained by the mirror 
averaging algorithm with [3=1, satisfies 

logM 

En-iJC{p,pn) < min lC{p,pj) -\ . 

i<j<A/ n 

We note that the KL aggregate pn as in Corollary 5.4 coincides with the 
"progressive mixture rule" considered by Catoni [7, 8, 9] and Yang [29] and 
the oracle inequality of Corollary 5.4 is the one obtained in those papers. 
We also note that this is the most trivial example of application of our re- 
sults. In fact, when Q is of the particular form (5.7), the convexity argument 
that we developed in Theorems 4.1 and 4.2 is not needed since Si = 0, so 
that Corollary 5.4 follows directly from Lemma 4.1. Writing the proof of 
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Lemma 4.1 for this particular Q we essentially recover the proof of Theo- 
rem 3.1.1 in [9]. Extension of Corollary 5.4 to /3 > 1 is straightforward but 
the oracle inequality for the corresponding aggregate ("Gibbs estimator"; 
cf. [9]) is less interesting because it has obviously a larger remainder term. 

5.2. Applications of Theorem 4-2. We now apply Theorem 4.2 to obtain 
bounds for the regression setup that are sharper than the existing ones. 
We also use this result to handle the problems of density estimation with 
squared loss and some examples of parametric estimation that cannot be 
treated using Theorem 4.1. 

4. Regression with squared loss and finite exponential moment. We con- 
sider here the regression model described in Corollary 5.1 under the addi- 
tional assumption that, conditionally on X, the regression residual admits 
an exponential moment, that is, there exist positive constants b and D such 
that, Px-sl.s., 

E{eMbmX)<D. 

Since E{^\X) = 0, this assumption is equivalent to the existence of positive 
constants 6o and a'^ such that, Px-&-s., 

(5.8) E{exp{tO\X) < exp(o-2tV2) V \t\ < bo; 

cf. [21], page 56. 

In this case, application of Corollary 5.1 leads to suboptimal rates because 
of the term i?[i?^(y)] in (5.4). We show now that, using Theorem 4.2, we 
can obtain an oracle inequality with optimal rate (logM)/n. 

To apply Theorem 4.2, we analyze the mapping 9 i— > Eexp{—Qi{Z, 0,6')/ (3). 
For the regression model with squared loss as described above, we have 
Z = {X, Y), Q{Z, e) = {Y- e'^H{X)f, and 

EeM-Qi{Z,e,9')/(3) 

= ^exp(^-i[(y - H{x)'ef -{Y- H{xye'f]) 

= Eexp (^-^[-2C{UiX, 9) - U{X, 9')) + U\X, 9) - U\X, 9')]^ , 

where U{X, 9) = f{X) - H{Xy9. Since 

\2{U{X, 9) - U{X, 9'))\ = 2\{9- 9')^ H{X)\ < 4L, 
conditioning on X and using (5.8) we get that, for any j3 > 4L/6o, 
Eexp{-Qi{Z,9,9')/(3) <^p{9,9'), 

where 

^fs{9,9') ^Eexp(^^[{9 - 9')^ H{X)f - -^[U\X,9) - U\X,9')]) . 
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Clearly, ^p{9,9) = 1. Thus, to apply Theorem 4.2 it suffices now to specify 
/3o > such that the mapping 

- ^{H{x)'9){H{xye') + 

is exponentially concave for all /? > /3o, 9' ^ Q and almost all x £ X. Note 
that 

VeQ{x,9,9') = (2j{f{x) - H{x)'^9) + ^{f{x) - H{x)^9')^H{x), 

VjgQ{x,9,9') = -2jH{x)H{x)'^, 

where 7 = ^ — Proposition 5.1 implies that Q is exponentially concave 
in 9 if VjgQ{x,9,9') + VeQ{x,9,9'){VeQ{x,9,9')y <0. If we assume that 
maxj 11/ — fjWoo < L, we obtain that the latter property holds for /? > /3o — 
20-2 + 2L2. Thus, Theorem 4.2 applies for (3 > max(2(T2 + 2L^4L/6o) and 
we have proved the following result. 

Corollary 5.5. Consider the regression model (5.2) where X £ X, 
y G M, f : X and the random variable ^ = Y — f{X) is such that there 
exist positive constants bo and a"^ for which (5.8) holds Px-a.s. Assume also 
that 11/ — /jiloo < L and ||/j||oo < L,j = 1, . . . ,M, for some finite positive 
constants L,L. Then for any (5 > max(2(T^ + 2L^, 4L/6o) ihe aggregate esti- 
mator fn{x) = 9^H{x),x G X, where 9n is obtained by the mirror averaging 
algorithm, satisfies 

(5.9) En^^\\fn - fWlp, < nrin ||/, - /|||p^ + ^1^. 

To see how good the constants are, we may compare this corollary with 
the results obtained in other papers for the particular case where ^ is con- 
ditionally Gaussian given X . In this case we have bQ = oo and Corollary 5.5 
yields the following result. 

Corollary 5.6. Consider the regression model (5.2) where X £ X, 
y S R, / : A" — > M and, conditionally on X , the random variable ^ = Y — f{X) 
is Gaussian with zero mean and variance bounded by cj^. Assume that \\f — 
/jiloo < L, for some finite constant L > 0. Then for any /3 > 2(7^ + 2L^ the 
aggregate estimator fn{x) = 9^H{x),x G X , where 9n is obtained by the mir- 
ror averaging algorithm, satisfies (5.9). 
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This result for Gaussian regression model is more general than that of 
[9], page 89, because we do not assume that / and all fj,j = l,... ,M, are 
uniformly bounded. Even if we assume in addition that / and all fj,j = 

1. . . . ,M, are uniformly bounded by L, Corollary 5.6 improves the result of 
[9], page 89. Indeed, in this case we have L <2L and a sufficient condition 
on f3 in Corollary 5.6 is /? > 2o"^ + 8L^. In [9], page 89, we find the result of 
Corollary 5.6, though under much more restrictive condition f3 > 18.01cr^ + 
70.4L2. 

5. Nonparametric density estimation with the L2 loss. Let /x be a a-finite 
measure on the measurable space {X, T). In this whole example, densities are 
understood with respect to ^ and || • ||oo denotes the Loo(/x)-norm. Assume 
that we have M probability densities Pj, ||pj||oo ^ L,j = 1, . . . , M , and of an 

1.1. d. sample Xi, . . . ,Xn where Xi take values in X, and are distributed as a 
random variable X with unknown probability density p such that ||p||oo 

for some positive constant L. Our goal is to mimic the oracle defined by 
mini<j<M IIPj -plli, where \\p\\l = J p'^ (x) iJ.{dx) . 

The corresponding loss function is defined, for any x £ X ,6 Q, by 

(5.10) Q{x, 9) = e~^G9 - 29^H{x), 

where H{x) = {pi{x), . . . ,pm{x))~^ and G is an M x M positive semidefi- 
nite matrix with elements Gjk = J PjPk < L. We set Z = X. Then A[9) = 
EQ{X,9) = \\p — 9'^ H\\2 — IIpIII- We now want to check conditions of The- 
orem 4.2, that is, to show that for the loss function (5.10), the mapping 
9^ E ex.p{—Qi{X,9,9')/ (3) is concave on 0, for any 9' £ Q and for (3 > (3o 
with some Pq > that will be specified below. Note first that 

Qiix,9,9') = Q{x,9)-Q{x,9') 

(5.11) 

= {9- 9'YG{9 + 6*0 - 2{9 - 9Yh{x). 

Fix 9' € Q. Concavity of the above mapping can be checked by considering 
its Hessian Ti. which, in view of (5.1), satisfies for any A G M*^, G 0, 

A^7^(e)A = ^i?{exp(-^ii^^)[(A"^V,Qi(X,^,^'))^ 

-(3X'vlgQiiX,9,9')X]y 

Note that for any x £ X,9 Q we have 

VeQi{x,9,9') = 2G9-2H{x) and VjgQi{x,9,9') = 2G. 
By (5.11) this yields, for any A G 9,9' G 9, 

A^WWA = -^i^{exp(- - ''^""^O ^ '">- ^"^ - '''''^'■^') 
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(5.12) X [PX^GX - 2{X'^{G9 - H{X))f]j 

2 / (9 - e'y G(9 + 9') \ , 

where 
(5.13) 

X [/3A^GA - 4:{X'^G9f - 4{X'^ H {X))'^] 

Observe that by the Cauchy inequahty 

(5.14) {X'^G9f <X~^GX9^G9<LX'^GX V0 G G. 

Further, 

E{X~^H{X)f = I {X'^H{x)fp{x)fi{dx) 

(5.15) 

<lJ {X~^ H{x)f^x{dx) = LX'^GX. 

Using (5.14) and (5.15) and the fact that ||^ — 6'\\i < 2 where || • ||i stands 
for the £i(M^O -norm, we obtam 

F{X,9,9') > iP-AL)X^GXEeKp^'^^^~^'^^^^^^ 



4E{e.p( ^'^-^'>'^W )(A^^(X))^ 



provided that 



> (/?-4L)A' GAexp{ -— j - 4LA ' GAexp(^— j >0 



/3-4L / 8L\ 

■ exp > 1. 



4L \ PJ 



Note that the last inequality is guaranteed for (3 > Po = 12L. We conclude 
that for (3 > 12L the Hessian 7i in (5.12) is negative semidefinite and there- 
fore the mapping 9 ^ Eexp{—Qi{X,9,9')/P) is concave on for any fixed 
9' £ Q. Thus we have proved the fohowing corollary of Theorem 4.2. 



Corollary 5.7. Consider the density estimation problem with the L2 
loss as described above. Then, for any jS > 12L, the aggregate estimator 
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Pn{x) =9^H{x),x G X, where On is obtained by the mirror averaging algo- 
rithm, satisfies 

P II ~ l|2<^ • II 112 , P^OSM 

En-i\\Pn - Ph < ^mm -PII2 + • 

6. Parametric estimation with Kullback-Leibler (KL) loss. Let V = {Pa^cL G 
.4} be a family of probability measures on a measurable space dom- 
inated by a cT-finite measure ji on (X,J^). Here A C M"^ is a bounded 
set of parameters. The densities relative to /i are denoted by p{x,a) = 
{dPa/dfj,){x),x G X. Let X be a random variable with values in X distributed 
according to Pa* where a* G ^ is the unknown true value of the parameter. 

In the aggregation framework, we have M values ai , . . . , cm G A (pre- 
liminary estimators of a) and of an i.i.d. sample Xi, . . . ,Xn where Xi take 
values in X, and have the same distribution as X. Our goal is to con- 
struct an aggregate a„ that mimics the parametric KL oracle defined by 
m.ini<j<M K{a* ,aj), where 

K{a,b)^}C{p{-,a),p{-,b)) Vo,6gA 

For X £ X, 6 £ Q, we introduce the corresponding loss function 

Q{x,e) = -logp{x,d^H), 

where H = {ai, . . . , om)^- We set Z = X. Then 

Ai9) = EQ{X, 0) = - J log(p(x, e^H))p{x, a*)//(dx), 
A{ej) =K{a*,aj) — / p{x,a*)log{p{x,a*))^{dx). 



Since, for ah x£X, exp(-Q(x, 9) //?) = {p{x, O'^H))^/'^, to apply Theorem 4.2 
we need the following assumption. 

Assumption 5.1. For some /3 > and for any a£ A there exists a Borel 
function ^'^ :Q xQ M_|_ such that 1— > ^ ^{9, 9') is concave on the simplex 
e for aU 9' G G, ^p{9, 9) = 1 and 



pI^^HT9') ) Pi^,^Mdx)<^,iO,9') 



for ah 9,9' eQ. 



Corollary 5.8. Consider the parametric estimation problem with the 
KL loss as described above and let J p{x,a*)\ logp{x , a*)\n{dx) < 00. Suppose 
that Assumption 5.1 is fulfilled for some f3 > 0. Then the aggregate estimator 
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dn = O^H of the parameter a*, where On is obtained by the mirror averaging 
algorithm, satisfies 

f3 log M 

(5.16) En-iK(a* ,dn) < min K(a*,a.j)-\ . 

i<j<M ■' n 

Aggregation procedures can be used to construct pointwise adaptive lo- 
cally parametric estimators in nonparametric regression [2]. In this case 
inequality (5.16) can be applied to prove the corresponding adaptive risk 
bounds. We now check that Assumption 5.1 is satisfied for several standard 
parametric families. 

• Univariate Gaussian distribution. Let ^ be the Lebesgue measure on M 
and let p{x, a) = (cr\/27r)~^ exp(— (x — a)^/ (2ct^)) be the univariate Gaus- 
sian density with mean a £ A = [—L, L] and known variance o"^ > 0. Re- 
placing f{x) by a* and H{x) by H in the proof of Corollary 5.6, and fol- 
lowing exactly the same argument as there we find that Assumption 5.1 
is satisfied for any /? > /3o = 2cr^ + 8L^. Hence, (5.16) also holds for such 
p. Note that in this case K{a*,a) = {a* - af/(2a'^). 

• Bernoulli distribution. Let jj, be the discrete measure on {0, 1} such that 
^i(O) = /i(l) = 1 and let p{x,a) = al{j,=o} + (1 ~ '^)l{a;=i} be the density 
of a Bernoulli random variable with parameter a £ A = (0,1). Then 

pix,H^e') ) ^(-'«)/^(^-) 

/ ttTo X 1//3 /I ttTo X 1//3 

This function is concave in 6 for any 9' £ Q if (3 > 1 and obviously ^i3{0, 
6) = 1. Therefore Assumption 5.1 is satisfied and Corollary 5.8 applies 
with (3 = 1. 

Poisson distribution. Let /i be the counting measure on the set of the non- 

negative integers N: fi{k) = 1, Vfc S N, and let p{x,a) = J2T=o T^^~"'^{x=k} 
be the density of a Poisson random variable with parameter a€ A = [i,L] 
where < i < L < oo. Then 

■fpix,H'^9)Y/^ 

p[x,a)fi[ax) 



\p{x,H^9') 



(5.17) 

= exp 



^/3(^,e')- 



Clearly, ^p{9,9) = 1 and it is not hard to show that in (5.17) is 
concave as a function of 9 for any 9' £ Q, provided that P > 1 + L{1 + 
L/^)(L/£)^/(^^+^) . Therefore Assumption 5.1 is satisfied and Corollary 5.8 
applies with p>f3o = l + L{l + L//)(L/£)V{2i+i) . 
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6. Proof of Proposition 2.1. In view of (2.2), for any selector r„ con- 
structed from the observations Zi , . . . , Z„ we have 



Taking expectation on both sides of the previous inequahty yields 



Since is the product of n multivariate Gaussian measures with means 
efc(o"/2)-\/log(M)/n and the covariance matrices o"^/, the Kullback-Leibler 
divergence between p'^ and is given explicitly by /C(p^,p^) = (logM)/4, 
for any k = 2, . . . , M , where p!^ denotes the density of P^. We can therefore 
apply Proposition 2.3 in [25] with a* = (logM)/4. Taking in that proposition 
r = 1/Af we get (6.1) with some c > which finishes the proof. 
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